Data

The data set used for this project was collected from 422 adult patients who underwent liver transplant surgery between 2009-2019 at Vanderbilt University Medical Center. The original dataset includes 140 variables including demographic information, pre-transplant and post-transplant comorbidities, surgical variables, and outcome information including days in various statuses, infections and mortality. Additionally, pre-transplant and post-transplant follow up CT scans were taken and analyzed for measures of body composition. Some patients have follow-up CT scans for up to 10 years, but many have only a few years of follow-up. It is important to note that the dates of liver transplants span from 01/09/2009 – 12/23/2018 while follow-up data was continued to be collected until summer 2021.

# Loading the Data
data <- box_read("942618589040")

Data Cleaning and Processing

Currently each observation is for a unique liver transplant patient. To model state transition probabilities for each patient overtime, the data needs to be processed so that each observations is for one day for each patient and their state on that given day.

Additionally, there is some cleaning/grouping of categorical variables that needs to perform before diving in to the analysis. Here is the first 5 rows for all observations to get an idea of what the data looks like.

paged_table(head(data))

Grouping Categories

There are some categorical variables in which certain categories have a very small number of patients. Additionally, some categories are very similar in nature. In these instances, similar and small categories can be grouped in ways that are useful for modeling without losing too much important information.

Location of Discharge

There are currently 5 categories for location of discharge.

  1. home
  2. SNF (Skilled nursing facility)
  3. IPR (In-patient Rehabilitation)
  4. LTAC (long term acute care)
  5. home PT (physical therapy)
table(data$locationof_d_chome0snf1ipr2ltac3home_pt4) %>% 
  kbl(col.names = c("Location of Discharge", "Frequency"), align = 'c') %>% kable_minimal
Location of Discharge Frequency
0 211
1 2
2 87
3 3
4 115

As you can see above, home and home with PT are very similar and hard to distinguish so these can be grouped. The other 3 groups all have smaller amounts of patients and are similar types of facilities that a patient could end up in after being discharged from the hospital so these can be grouped into a ‘long term care’ category.

data <- data %>% mutate(location_dc = case_when(
  locationof_d_chome0snf1ipr2ltac3home_pt4 == 0 ~ 0,
  locationof_d_chome0snf1ipr2ltac3home_pt4 == 1 ~ 1,
  locationof_d_chome0snf1ipr2ltac3home_pt4 == 2 ~ 1,
  locationof_d_chome0snf1ipr2ltac3home_pt4 == 3 ~ 1,
  locationof_d_chome0snf1ipr2ltac3home_pt4 == 4 ~ 0
))

table(data$location_dc) %>% 
  kbl(col.names = c("New Location of Discharge", "Frequency"), align = 'c') %>% kable_minimal
New Location of Discharge Frequency
0 326
1 92

Now locationDC is a new variable with the following groups:

  1. home
  2. long term care

Biliary Complications

There is a not big difference between different types of biliary complications and some of the categories have very small numbers as shown in the table below.

table(data$biliarycomplications_21bileleak2biloma3stricture4other) %>% 
  kbl(col.names = c("Biliary Complications", "Frequency"), align = 'c') %>% kable_minimal
Biliary Complications Frequency
1 43
2 5
3 73
4 15

Currently the values for this variable are

  1. bile duct leak
  2. biloma
  3. stricture
  4. other

This can be transformed into a yes/no variable as to whether the patient had any biliary complications or not.

data <- data %>% mutate(biliarycomplication = case_when(
  is.na(biliarycomplications_21bileleak2biloma3stricture4other) ~ 0,
  biliarycomplications_21bileleak2biloma3stricture4other > 0 ~ 1
))

table(data$biliarycomplication) %>% 
  kbl(col.names = c("New Biliary Complications", "Frequency"), align = 'c') %>% kable_minimal
New Biliary Complications Frequency
0 286
1 136

Now biliarycomplication is a new variable with the following groups: 0. No 1. Yes

Pre and Post Condition Variables

There are quite a few variables that indicate whether a patient had a condition (yes or no) before their liver transplant and then again whether they had the condition after their transplant. I do not have any information on what pre and post mean in this context such as when these diagnoses happened. Therefore, what is more interesting is whether a patient has had a condition at all. Therefore, these 8 binary variables can be transformed into 4 indicator variables of whether the patient has had the condition at some point regardless of whether is was pre or post liver transplant surgery. This also is helpful for modeling as some of these indicator variables have low values and grouping them can resolve this issue.

data <- data %>% mutate(copd = case_when(
  (copd_pre == 0 & copd_post == 0) ~ 0,
  (copd_pre == 1 & copd_post == 1) ~ 1,
  (copd_pre == 0 & copd_post == 1) ~ 1,
  (copd_pre == 1 & copd_post == 0) ~ 1
),
htn = case_when(
  (htn_pre == 0 & htn_post == 0) ~ 0,
  (htn_pre == 1 & htn_post == 1) ~ 1,
  (htn_pre == 0 & htn_post == 1) ~ 1,
  (htn_pre == 1 & htn_post == 0) ~ 1
),
dm = case_when(
  (dm_pre == 0 & dm_post == 0) ~ 0,
  (dm_pre == 1 & dm_post == 1) ~ 1,
  (dm_pre == 0 & dm_post == 1) ~ 1,
  (dm_pre == 1 & dm_post == 0) ~ 1
),
cad = case_when(
  (cad_pre == 0 & cad_post == 0) ~ 0,
  (cad_pre == 1 & cad_post == 1) ~ 1,
  (cad_pre == 0 & cad_post == 1) ~ 1,
  (cad_pre == 1 & cad_post == 0) ~ 1
),
ckd = case_when(
  (ckd_pre == 0 & ckd_post == 0) ~ 0,
  (ckd_pre == 1 & ckd_post == 1) ~ 1,
  (ckd_pre == 0 & ckd_post == 1) ~ 1,
  (ckd_pre == 1 & ckd_post == 0) ~ 1
),
ctd = case_when(
  (ctd_pre == 0 & ctd_post == 0) ~ 0,
  (ctd_pre == 1 & ctd_post == 1) ~ 1,
  (ctd_pre == 0 & ctd_post == 1) ~ 1,
  (ctd_pre == 1 & ctd_post == 0) ~ 1
)
)

Here for each of these new variables 0. no condition pre or post surgery 1. condition either or both pre and post surgery

data %>% select(copd, htn, dm, cad, ckd, ctd) %>% head() %>% kbl() %>% kable_minimal
copd htn dm cad ckd ctd
0 1 1 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 1 0 1 0
0 1 1 0 0 0
0 0 0 0 1 0

Transition States

The states that patients can transition from post liver transplant from best to worst are

  1. Home/Long Term Care
  2. Hospital
  3. ICU/Ventilator
  4. Death

The typical order a patient will go through post liver transplant are in the ICU and/or on a ventilator, in the hospital, at an inpatient long term facility or at home, and then finally death. This is the most common transition overtime, however, not all patients will follow this trend, may skip states and may never reach certain states.

Ideally, this analysis would be improved with being able to distinguish between a patient being at home or at a long term facility, however, unfortunately this level of information was not provided accurately in the data.

Currently, the data has the number of days in each of these states or until these states, but does not show patients moving through these different states overtime. It needs to be processed to obtain a variable for states and time. Here we can see the data provided with days in each state.

data %>% select(iculos, index_los, days_until_death, dayson_ventpost_lt) %>% head()

Check for Missing State Information

To create the outcome states, there must be no missing observations for any of the state variables.

data %>% select(iculos, dayson_ventpost_lt,index_los) %>% gg_miss_var(show_pct=TRUE)

Here we can see there is one missing observation for iculos and one missing observation for dayson_ventpost_lt. These were imputed using multiple imputation and predictive mean matching with 5 iterations.

set.seed(5225)

a <- aregImpute(~age + height + weight + ascites_yn + he + race + sex + smoker + cit +
                  etiology + meld + acr + newmalignancy + surgeryduration +
    readmissionwithin90d + infection_90 + need_repeat_surgery +
    vascularcomplicationswithin90days + sma00 + vat00 + sat00 + s_mhu00 + v_fhu00 +
      sa_thu00 + biliarycomplication + index_los + days_until_death + iculos + location_dc 
    + I(dayson_ventpost_lt) + I(htn) + I(dm) + I(ckd) + copd + cad + ctd,
    data = data, n.impute = 5, x = TRUE)

imputed_data <- as.data.frame(a[['x']]) %>% select(iculos, dayson_ventpost_lt)

#original data with imputed values
data <- cbind(data %>% select(-iculos, -dayson_ventpost_lt), imputed_data)

Create Days of each State

All patients start in the ICU and or on a ventilator post liver transplant. The maximum of these two values is used to determine the days in the state ICU/Ventilator. Then patients will transition to the regular hospital ward for the remainder of index_los days (total length of hospital stay).

data$icu_vent <- pmax(data$dayson_ventpost_lt, data$iculos)

data['days_hospital'] <- ifelse(data$index_los - data$icu_vent >= 0,
                                data$index_los - data$icu_vent, 0)

There is no current indicator for days at home or days in a rehab facility. As there is no readmission data or any further dates of changes, it must be assumed that once a patient is discharged to home or an inpatient facility, that the rest of the days should be home or inpatient facility (not yet accounting for deaths). Days at home for patients are only created until the maximum date for the longest stay in the hospital in the data set plus 30 days (this is a total of 95 days). Although there is death information for up to 10 years for some patients, the analysis will not go beyond a few months. Additionally, there is no information on the cause of death, therefore the further post liver transplant a patient dies, the less is known about how much the liver transplant itself factors in to the patient’s death.

#find total days in hospital
data <- data %>% mutate(total_days = icu_vent + days_hospital)

#find the maximum number of days of hospital stay in data
days <- data %>% select(total_days)
max_day <- max(days[!is.na(days),]) + 30

#create home/facility days
data <- data %>% mutate(days_home_facility = max_day - total_days)

Create rows for each day

Now that the number of days each patient was in each state has been created, a row for each day for each patient can be derived with the corresponding state for that day.

The following code creates a function to replicate as many rows as there are days for a given state.

days_per_state <- function(df, column){
  new_df <- data.frame(matrix(ncol = length(data) + 1, nrow = 0))
  colnames(new_df) <- c(colnames(data),'state')
  
  for (i in 1:nrow(df)){
    num_rows <- df[i,column]
    if (num_rows == 0){
      invisible()
    } else{
      for (j in 1:num_rows){
        d <- df[i,]
        d['state'] <- column
        new_df <- rbind(new_df, d)
      }
    }
  }
  return(new_df)
}

The next code chunk runs the new function on each of the days in a state columns to create additional rows for each patient for each observation day over the whole time period.

processed_data <- data.frame(matrix(ncol = length(data) + 1, nrow = 0))
colnames(processed_data) <- c(colnames(data),'state')

columns <- c('icu_vent', 'days_hospital','days_home_facility')
for (column in columns){
  new_data <- days_per_state(data, column)
  processed_data <- rbind(processed_data, new_data)
}

A variable to indicate time (in days) also needs to be added to the data. For this variable each observation is one day and there should be the same number of days for each patient (95 days in total).

processed_data <- ddply(processed_data, .(patient_mrn), transform, 
                        time = seq_along(patient_mrn)-1)

Factoring in death

The only state that has not yet been accounted for is death. If a patient dies during the time period, their state needs to be changed to death on the day they died and then all states after that should be converted to death because death is an absorbing state.

patients_died <- processed_data %>% filter(days_until_death <= max_day) %>%
                 select(patient_mrn) %>% unique()

for (patient in patients_died$patient_mrn){
  df_d <- processed_data %>% filter(patient_mrn == patient)
  death_date <- as.numeric(df_d$days_until_death[1])
  processed_data[(processed_data$patient_mrn == patient),][death_date:max_day,'state'] <- 'dead'
}

Thee states can be encoded as numbers like the following for ease of use throughout the rest if the analysis:

  1. Home/Inpatient Rehab
  2. Hospital
  3. ICU/Ventilator
  4. Death
processed_data <- processed_data %>% mutate(state = case_when(
                        state == "days_home_facility" ~ 1,
                        state == "days_hospital" ~ 2,
                        state == "icu_vent" ~ 3,
                        state == "dead" ~ 4
))

Previous State

Now that the state outcome variable has been created, it can be used to create the previous state. This variable is important in a first order Markov state transition model in order to condition the predicted outcome on the previous state.

processed_data <- processed_data %>% 
                  mutate(prev_state = ifelse(time==0,0, lag(state)))

processed_data <- processed_data %>% select(-days_until_death, -index_los, -iculos,
                                            -days_hospital, -days_home_facility, -total_days,
                                            -icu_vent, -dayson_ventpost_lt)
processed_data$state <- factor(processed_data$state,
levels = c(1,2,3,4),
labels = c("Home/IPR", "Hospital", "ICU/Vent", "Dead"))

processed_data$prev_state <- factor(processed_data$prev_state,
levels = c(1,2,3,4),
labels = c("Home/IPR", "Hospital", "ICU/Vent", "Dead"))

Here we can see a table of the state and previous state variables.

table("State" = processed_data$state, "Previous State" = processed_data$prev_state) %>% kbl(align = 'c') %>% kable_minimal
Home/IPR Hospital ICU/Vent Dead
Home/IPR 33717 402 16 0
Hospital 0 3122 404 0
ICU/Vent 0 0 1483 0
Dead 5 2 2 515

Variable Selection

There are many variables that can be remove based on them being intermediary variables to clean other variables, domain knowledge, cleanliness, or due them being outcomes. These can be filtered out immediately.

These variables include longitudinal body composition measures from annual check ups, bmi (calculated based on height and weight which are already in the data), etc.

predictors <- processed_data %>% select(age, height, weight, ascites_yn,
                he, race, sex, smoker, cit, etiology, meld, acr, surgeryduration,
                sma00, smi00, vat00, sat00, s_mhu00, v_fhu00, sa_thu00,
                copd, htn, dm, cad, ckd, ctd, time, prev_state)

Here is a list of the remaining variables and their data types.

print(str(predictors))
## 'data.frame':    40090 obs. of  28 variables:
##  $ age            : int  57 57 57 57 57 57 57 57 57 57 ...
##  $ height         : num  163 163 163 163 163 ...
##  $ weight         : num  80.2 80.2 80.2 80.2 80.2 80.2 80.2 80.2 80.2 80.2 ...
##  $ ascites_yn     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ he             : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ race           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sex            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoker         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cit            : num  9.07 9.07 9.07 9.07 9.07 9.07 9.07 9.07 9.07 9.07 ...
##  $ etiology       : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ meld           : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ acr            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ surgeryduration: int  355 355 355 355 355 355 355 355 355 355 ...
##  $ sma00          : num  138 138 138 138 138 ...
##  $ smi00          : num  52.1 52.1 52.1 52.1 52.1 ...
##  $ vat00          : num  182 182 182 182 182 ...
##  $ sat00          : num  289 289 289 289 289 ...
##  $ s_mhu00        : num  14.1 14.1 14.1 14.1 14.1 ...
##  $ v_fhu00        : num  -86.2 -86.2 -86.2 -86.2 -86.2 ...
##  $ sa_thu00       : num  -85.5 -85.5 -85.5 -85.5 -85.5 ...
##  $ copd           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ htn            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ dm             : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ cad            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ckd            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ ctd            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ time           : num  0 1 2 3 4 5 6 7 8 9 ...
##  $ prev_state     : Factor w/ 4 levels "Home/IPR","Hospital",..: NA 3 3 3 3 3 3 3 2 2 ...
## NULL

Some of these possible predictors need to be converted to factors using the code below.

convert <- c('ascites_yn', 'he', 'race', 'sex', 'smoker', 'etiology', 'acr', 'copd', 
             'htn', 'dm', 'cad', 'ckd', 'ctd', 'prev_state')

for (var in convert){
  predictors[var] <- as.factor(predictors[[var]])
}

Finally, we can get a better understanding of these possible predictors by looking at summary statistics and distributions for each one.

html(describe(predictors))
predictors

28 Variables   40090 Observations

age
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
400900480.99955.6810.6936435157636768
lowest : 16 18 19 26 31 , highest: 71 72 73 74 77
height
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
400900630.997172.111.72154.9157.5165.1172.7180.3185.4188.0
lowest : 142.240 147.320 149.860 150.000 150.114 , highest: 193.040 195.580 198.100 200.660 203.200
weight
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
400900347186.8223.73 56.40 60.78 71.00 85.00 98.88114.39124.51
lowest : 31.30 40.00 45.00 45.45 46.30 , highest: 147.00 147.40 150.00 150.22 179.02
ascites_yn
nmissingdistinct
4009002
 Value          0     1
 Frequency   7695 32395
 Proportion 0.192 0.808
 

he
nmissingdistinct
4009002
 Value          0     1
 Frequency   9310 30780
 Proportion 0.232 0.768
 

race
nmissingdistinct
4009002
 Value          0     1
 Frequency  36860  3230
 Proportion 0.919 0.081
 

sex
nmissingdistinct
4009002
 Value          0     1
 Frequency  25270 14820
 Proportion  0.63  0.37
 

smoker
nmissingdistinct
4009002
 Value          0     1
 Frequency  35625  4465
 Proportion 0.889 0.111
 

cit
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
399959530015.7412.6222.483.123.985.387.108.889.88
lowest : 0.32 0.55 0.65 1.81 2.00 , highest: 13.22 13.44 16.37 16.63 22.08
etiology
image
nmissingdistinct
4009005
lowest : 0 1 2 3 4 , highest: 0 1 2 3 4
 Value          0     1     2     3     4
 Frequency  10925  9880 11970  3800  3515
 Proportion 0.273 0.246 0.299 0.095 0.088
 

meld
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
3999595410.99922.1410.43 8101522283438
lowest : 6 7 8 9 10 , highest: 42 43 48 49 51
acr
nmissingdistinct
4009002
 Value          0     1
 Frequency  36670  3420
 Proportion 0.915 0.085
 

surgeryduration
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
39995952041312.978.5217233258305353400456
lowest : 166 168 180 183 186 , highest: 542 545 562 591 644
sma00
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
399001904201164.244.18108.2117.6135.0159.5189.6212.5231.4
lowest : 71.2681 75.2116 81.8374 81.9047 82.8039 , highest: 283.8478 288.2231 312.0324 313.5865 333.0246
smi00
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
382851805403155.4713.1440.1142.7047.1353.6362.2070.4677.18
lowest : 20.72918 28.46148 29.14429 29.93326 33.68552
highest: 93.29691 94.08640 97.15067 98.00957111.65846

vat00
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
398052854191115.289.39 14.06 25.08 53.44 98.32158.58220.54268.63
lowest : 2.4843 2.4943 2.7736 3.1662 3.3079 , highest: 366.8305 367.2028 371.1341 434.9041 736.1889
sat00
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
398052854191232.9143.4 48.29 80.15135.11211.87322.35407.34469.77
lowest : 12.0312 13.8672 19.0734 21.1624 24.3164 , highest: 638.1798 647.5449 649.2393 657.9833 658.2355
s_mhu00
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
39805285419129.3110.5514.1816.9222.4429.5135.3841.9645.23
lowest : 7.8584 8.3023 8.4103 8.9505 9.2501 , highest: 48.5764 48.6109 48.8868 52.0062 56.7703
v_fhu00
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
398052854191-79.599.381-95.84-90.72-85.14-78.45-73.68-69.72-67.91
lowest :-107.1590-105.7500-104.2460-103.8350-102.6440
highest: -62.7276 -62.1905 -62.1478 -61.3434 -59.4222

sa_thu00
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
    39805      285      419        1   -82.99    16.28  -103.63  -100.24   -94.16 
      .50      .75      .90      .95 
   -84.92   -73.43   -62.50   -55.56 
 
lowest :-114.3210-111.9750-109.0650-108.3390-108.2930
highest: -49.4904 -49.3095 -48.6547 -46.9401 -45.9087

copd
nmissingdistinct
39995952
 Value          0     1
 Frequency  37145  2850
 Proportion 0.929 0.071
 

htn
nmissingdistinct
39995952
 Value          0     1
 Frequency  23370 16625
 Proportion 0.584 0.416
 

dm
nmissingdistinct
39995952
 Value          0     1
 Frequency  25270 14725
 Proportion 0.632 0.368
 

cad
nmissingdistinct
39995952
 Value          0     1
 Frequency  28690 11305
 Proportion 0.717 0.283
 

ckd
nmissingdistinct
39995952
 Value          0     1
 Frequency  26220 13775
 Proportion 0.656 0.344
 

ctd
nmissingdistinct
4009002
 Value          0     1
 Frequency  39615   475
 Proportion 0.988 0.012
 

time
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
4009009514731.66 4 92347718590
lowest : 0 1 2 3 4 , highest: 90 91 92 93 94
prev_state
image
nmissingdistinct
396684224
 Value      Home/IPR Hospital ICU/Vent     Dead
 Frequency     33722     3526     1905      515
 Proportion    0.850    0.089    0.048    0.013
 

Missingness

The first method for variable selection is to look at the percentage of missing values in each of the possible features. Imputation can be used, but if any variables are missing a very large portion of data, they might not be useful in a model.

gg_miss_var(predictors, show_pct=TRUE)

Luckily, there are not very many observations missing for any of these variables. The one missing the most is smi00 but it is only missing ~4.5% of observation which can easily be imputed.

Redundancy Analysis

Now we can see if any variables are redundant, meaning they can easily be explained by other variables. If a variable is easily explained by the other possible predictors, it would be redundant to use it in a model, especially when there are many possible predictors to use for modeling.

r <- redun(as.formula(paste0("~ ",paste0(colnames(predictors),collapse = "+"))), type = 'adjusted', data=predictors, r2 = 0.8)

r['rsq1'] %>% kbl(col.names = "Adjusted $R^2$")
Adjusted \(R^2\)
age 0.3166978
height 0.9932276
weight 0.6996133
ascites_yn 0.3647461
he 0.2705152
race 0.1575267
sex 0.6999512
smoker 0.1736233
cit 0.1289965
etiology 0.3410604
meld 0.4865624
acr 0.1129448
surgeryduration 0.1578348
sma00 0.9905794
smi00 0.9902820
vat00 0.6847470
sat00 0.5996956
s_mhu00 0.5002715
v_fhu00 0.7599690
sa_thu00 0.8111483
copd 0.1419144
htn 0.2677915
dm 0.2637333
cad 0.2006321
ckd 0.2388503
ctd 0.1173030
time 0.2890093
prev_state 0.4744409
#r['rsquared']  %>% kbl(col.names = "Adjusted $R^2$")

With an adjusted R^2 of 0.99 the variance in height is easily explained by the other variables. This actually makes sense due to the fact that \(smi00 = sma00/height\).

redun(~ smi00 + height + sma00, type = 'adjusted', data=predictors, r2 = 0.8)
## 
## Redundancy Analysis
## 
## redun(formula = ~smi00 + height + sma00, data = predictors, r2 = 0.8, 
##     type = "adjusted")
## 
## n: 38285     p: 3    nk: 3 
## 
## Number of NAs:    1805 
## Frequencies of Missing Values Due to Each Variable
##  smi00 height  sma00 
##   1805      0    190 
## 
## 
## Transformation of target variables forced to be linear
## 
## R-squared cutoff: 0.8    Type: adjusted 
## 
## R^2 with which each variable can be predicted from all other variables:
## 
##  smi00 height  sma00 
##  0.989  0.992  0.989 
## 
## Rendundant variables:
## 
## height
## 
## Predicted from variables:
## 
## smi00 sma00 
## 
##   Variable Deleted   R^2 R^2 after later deletions
## 1           height 0.992

It’s better to keep the original measurements in the model rather than derived measurements. Because smi00 (skeletal muscle mass index) is based on height and sma00, it can be removed.

predictors <- predictors %>% select(-smi00)

Now we can perform the redundancy analysis again to determine whether any other variables are redundant with smi00 removed.

r2 <- redun(as.formula(paste0("~ ",paste0(colnames(predictors),collapse = "+"))), type = 'adjusted', data=predictors, r2 = .8)

r2['rsq1'] %>% kbl(col.names = "Adjusted $R^2$")
Adjusted \(R^2\)
age 0.3304891
height 0.6078744
weight 0.7056083
ascites_yn 0.3500264
he 0.2717396
race 0.1540163
sex 0.6913783
smoker 0.1717346
cit 0.1226301
etiology 0.3355259
meld 0.4782940
acr 0.0987534
surgeryduration 0.1638818
sma00 0.6074320
vat00 0.6857524
sat00 0.6061284
s_mhu00 0.5120839
v_fhu00 0.7596956
sa_thu00 0.8095394
copd 0.1314401
htn 0.2660341
dm 0.2614404
cad 0.1983304
ckd 0.2189785
ctd 0.1058948
time 0.2860319
prev_state 0.4715145
#r2['rsquared']  %>% kbl(col.names = "Adjusted $R^2$")

There are still some body composition variables with pretty high adjusted R^2 values, meaning their variance can be explained by other variables, but because these are important variables to this analysis, they should be kept as predictors in the model.

Variable Clustering

Variable clustering can also be used to see if there are any strong clusters within the variables of interest.

#clustering using hoeffding D
vc <- varclus(as.formula(paste0("~ ",paste0(colnames(predictors),collapse = "+"))), 
              data = predictors, sim = 'hoeffding')

#clustering using spearman
vc2 <- varclus(as.formula(paste0("~ ",paste0(colnames(predictors),collapse = "+"))), 
              data = predictors, sim = 'spearman')

#joining the two cluster analyses
par(mfrow=c(2,1)) 
plot(vc)
plot(vc2)

Both methods of clustering, Hoeffding D and Spearman \(\rho^2\) produce similar results. v_fhu00 (visceral adipose tissue density) and sa_thu00 (subcutaneous adipose tissue density) cluster together with a high correlation, however as these are both important body composition variables in this analysis they will both be kept as predictors for modeling. They also cluster with vat00, which is visceral adipose tissue. weight and sat00 also cluster together which makes sense as sat00 is a body fat measure which could be related to weight. weight can be removed as all the body composition variables measured from CT scans are important to keep and weight is highly related to at least some of those measures.

predictors <- predictors %>% select(-weight)

height also seems to be highly correlated with sex. This makes sense as males are usually taller than females. Because height can be age-dependent in the elderly, its good to check if that is the case within this set of patients.

d <- datadist(predictors)
options(datadist="d")
f <- orm(height ~ rcs(age,4)*sex, data = predictors)
qu <- Quantile(f); med <- function(x) qu(.5, x)
ggplot(Predict(f, age, sex, fun=med, conf.int=FALSE), ylab='Predicted Median Height, cm')

For patients younger than 40, there seems to be odd behavior but the average is fairly flat for all patients above 40 meaning there is no relationship with height decreasing in the elderly. At this stage in the process, it makes sense to include both height and sex.

Final vairables to include in the modeling process

The final variables being included as possible predictors for modeling are:

age, height, ascites_yn, he, race, sex, smoker, cit, etiology, meld, acr, surgeryduration, sma00, vat00, sat00, s_mhu00, v_fhu00, sa_thu00, copd, htn, dm, cad, ckd, ctd, time, prev_state

#adding outcome variable to data set
data <- cbind(predictors,state = processed_data$state, patient_mrn =  processed_data$patient_mrn)

#filtering out day 0 with no previous state
data <- data %>% filter(!is.na(prev_state))

Data Imputation

Even though this is a pretty large data set, because the data only contains 422 patients, its important to keep every single patient to train a model. Therefore, missing values need to be imputed.

gg_miss_var(data, show_pct=TRUE) + labs(title = "Percent of Missing Data")

As shown similarly during variable selection as well as in the plot above, there are only a small percentage of missing values, less than 1% for any variable which is really promising. Multiple imputation using predictive mean matching with 5 iterations of imputation is used to impute missing values. All variables including the outcome variable are used to impute all other variables in order to get the best imputation results.

set.seed(578)

data$state <- factor(data$state,
levels = c("Home/IPR", "Hospital", "ICU/Vent", "Dead"))

data$prev_state <- factor(data$prev_state,
levels = c("Home/IPR", "Hospital", "ICU/Vent", "Dead"))

a <- aregImpute(~ age + height + ascites_yn + he + race + sex + smoker + cit +
                etiology + meld + acr + surgeryduration + sma00 + vat00 + sat00 + 
                  s_mhu00 + v_fhu00 + sa_thu00 + htn + dm + ckd + 
      copd + cad + ctd + prev_state + time + state + patient_mrn,
    data = data, n.impute = 5, x = TRUE)

data <- as.data.frame(a[['x']])

convert <- c('ascites_yn', 'he', 'race', 'sex', 'smoker', 'etiology', 'acr', 'copd', 
             'htn', 'dm', 'cad', 'ckd', 'ctd')

for (var in convert){
  data[var] <- factor(data[[var]], )
}

data$state <- factor(data$state,
levels = c(1,2,3,4),
labels = c("Home/IPR", "Hospital", "ICU/Vent", "Dead"))

data$prev_state <- factor(data$prev_state,
levels = c(1,2,3,4),
labels = c("Home/IPR", "Hospital", "ICU/Vent", "Dead"))

Viewing the State Transition Over Time

Event Chart

Here is an event chart for a random sample of patients removing all days post death if this state occurs in the random sample. This just gives a flavor of what states patients go through over time and shows trends in the data. As you can see, all patients start in the ICU and most end at home or in an inpatient long term facility. The black square at the bottom of the plot indicates that patient died on day 9 and all remaining observations are removed as death is an absorbing state.

set.seed(218)
ssamp <- sample(unique(data$patient_mrn), 30, FALSE)
dr <- subset(data, patient_mrn %in% ssamp)
dr <- subset(dr, prev_state != 'Dead')
dr <- subset(dr, time <= 61)
dr$id <- as.integer(as.factor(dr$patient_mrn))
dr$status <- factor(dr$state, levels=rev(levels(dr$state)))
dr$time <- dr$time - 1

datad <- datadist(dr)
options(datadist="datad")
multEventChart(state ~ time + id, data=dr,
               absorb='Dead', sortbylast = FALSE) +
  theme_classic() +
  theme(legend.position='bottom') +
  labs(title = "State Transitions for 60 Days Post Liver Transplant", y = "Days") +
  scale_y_continuous(
    breaks = seq(0,60,5))

The plot below shows a summary of successive state transitions over the first 30 days after liver transplant. All patients start in the ICU on day 0. As you can see here on day 1 already many patients move out of the ICU and into a different ward in the hospital. By the end of this time period the most common transition is from hospital to home/IPR which makes sense as patients are discharged after recovering from their liver transplant.

d <- data %>% filter(time < 31)

propsTrans(state ~ time + patient_mrn, data=d, maxsize=6) +
  theme(legend.position='bottom', axis.text.x=element_text(angle=90, hjust=1))

Checking the Proportional Odds Assumption

The Proportional Odds model relies on the assumption that there are proportional odds between different states. To see whether the proportional odds assumption is a reasonable assumption to make based on the data, we can look at some plots and check if there is parallelism between states.

Proportion of Each State Estimated from Day 1

Here is a plot of the proportion of each state over time. As mentioned before, on day one (day after surgery transplant) most patients are still in the ICU with only a few in the hospital (not ICU). The small pink band at the top shows deaths. This area will always be increasing as death is an absorbing state. By day 65, all patients have been either discharged to home or a facility or have passed away.

propsPO(state ~ time, data=data, nrow=1) +
  labs(title = "Proportion of Patients in Each State Over Time", x = "Time") +
  theme(legend.position='bottom', axis.text.x=element_text(angle=90, hjust=1))

To understand if the proportional odds assumption holds, we can look at the observed proportions on top of the expected proportions assuming the proportional odds assumption been satisfied for comparison. This was computed using time as a sole predictor as a flexible spline. The odds ratios are computed at day 1 and then applied to the observed proportions on day 1.

datad <- datadist(data)
options(datadist="datad")
f <- lrm(state ~ rcs(time, 5), data=data)
g <- Function(f)   # derive predicted log odds as a function of day
# Make another function that gets the OR vs. day=1
dor <- function(time) exp(g(time) - g(1))
propsPO(state ~ time, odds.ratio=dor, data=data) +
  theme(legend.position='bottom', axis.text.x=element_text(angle=90, hjust=1))

With respect to time, the proportional odds assumption does not seem to hold. This was done carrying death forward. Although there are very little deaths in the subset of the first 34 days in this plot, as death is an absorbing state, it makes sense to terminate follow-up after death.

Here is the same plot as above but with no observations for a patient after they die.

dd <- subset(data, prev_state != 'Dead')
f2 <- orm(state ~ rcs(time, 5), data=dd)
g2 <- Function(f2)   # derive predicted log odds as a function of day
# Make another function that gets the OR vs. day=1
dor <- function(time) exp(g2(time) - g2(1))
propsPO(state ~ time, odds.ratio=dor, data=dd) +
  theme(legend.position='bottom', axis.text.x=element_text(angle=90, hjust=1))

After terminating follow-up after death, the proportional odds assumption seems far from true. This makes sense as all patients begin on day 0 in the ICU post liver transplant and therefore state is highly dependent on time.

Although the proportional odds model relies on the PO assumption and this data set does not meet this assumption, we can still get some beneficial results from a PO model. See here for details on why it is okay to use the proportional odds model when the PO assumption has not been met.

As death is absorbing, it isn’t important to predict death once a patient is already dead. That is not interesting and will only bias the interpretation of predicting states up until the day of death. For this reason, all observations after a patient has died are removed for the remainder of this analysis.

#truncate after death
data <- subset(data, prev_state != 'Dead')
data$prev_state <- factor(data$prev_state,
levels = c("Home/IPR", "Hospital", "ICU/Vent"))

Proportional Odds Modeling

Allocating DF to predictors

The following formula from Section 4.4 of Regression Modeling Strategies was used to calculate the effective sample size and understand how many parameters a model can have without having issues of overfitting.

\[ \begin{aligned} n - \frac{1}{n^2}\sum_{i=1}^{k}n_{i}^3 \end{aligned} \]

table(data$state) %>% kbl(align = 'c') %>% kable_minimal
Var1 Freq
Home/IPR 34135
Hospital 3526
ICU/Vent 1483
Dead 9

\[ \begin{aligned} 39153 - \frac{1}{39153^2}\sum_{i=1}^{4}n_{i}^3 = 39153 - \frac{1}{39153^2}(34135^3 + 3524^3 + 1485^3 + 9^3) = 39152.34 \end{aligned} \]

This is essentially the same as the sample size which gives a lot of flexibility for modeling. For example, using the 15:1 rule of thumb as a very basic guideline, the model could have \(39152.34/15 = 2610\) parameters and still satisfy the rule. This allows for a very flexible model with splines and interactions without concern.

To understand the relative importance of predictors, in order to allocate degrees of freedom and give more flexibility to the most important predictors, a model can be fit adjusting the variance-covariance matrix for within-patient correlation using the robust cluster sandwich covariance estimator. Continuous covariates have nonlinear effects using restricted cubic splines with 5 knots.

data$patient_mrn <- as.character(data$patient_mrn)

f <- orm(state ~ ctd + sex + rcs(age,5) + rcs(sma00,5) + rcs(height,5) + ascites_yn + he +
           race + smoker + rcs(cit,5) + etiology + rcs(meld,5) + acr + 
    rcs(surgeryduration,5) + rcs(vat00,5) + rcs(sat00,5) + 
    rcs(s_mhu00,5) + rcs(v_fhu00,5) + rcs(sa_thu00,5) + 
      htn + dm + ckd + rcs(time,5) + prev_state, data=data, x=TRUE, y=TRUE, maxit = 30)
g <- robcov(f, data$patient_mrn)
plot(anova(g), trans=sqrt)

prev_state, meld, time and surgeryduration have the highest \(\chi^2\) values. These variables can be allocated more degrees of freedom, specifically using splines for continuous variables. There is flexibility to use splines on many predictors as the sample size is large enough to allow for more parameters.

GEE Model for PO Model

Here is a model built to predict state for each day post liver transplant up to 90 days terminating follow-up after death. In addition to using splines to relax the linearity assumption, interaction terms can be added to the model. Based on domain knowledge, age and sex could both interact with body composition. Because sma00 (skeletal muscle area) had some correlation with sex, found in the variable clustering above, an interaction term is added between these two variables. Age can also interact with sma00 as skeletal muscle area can change differently with age.

The model is a PO model using the cluster sandwich covariance estimator to adjust the variance-covariance matrix for within-patient correlation.

m <- orm(state ~ ctd + (sex + rcs(age,5) + rcs(sma00,4))^2 - rcs(age,5):sex +
           rcs(height,3) + ascites_yn + he +race + smoker + rcs(cit,3) + etiology +
           rcs(meld,5) + acr + rcs(surgeryduration,5) + rcs(vat00,4) + rcs(sat00,3) + 
           rcs(s_mhu00,3) + rcs(v_fhu00,3) + rcs(sa_thu00,4) +
           htn + dm + ckd + rcs(time,5) + prev_state, 
           data=data, x=TRUE, y=TRUE, maxit = 30)
g <- robcov(m, data$patient_mrn)
print(g)
Logistic (Proportional Odds) Ordinal Regression Model
 orm(formula = state ~ ctd + (sex + rcs(age, 5) + rcs(sma00, 4))^2 - 
     rcs(age, 5):sex + rcs(height, 3) + ascites_yn + he + race + 
     smoker + rcs(cit, 3) + etiology + rcs(meld, 5) + acr + rcs(surgeryduration, 
     5) + rcs(vat00, 4) + rcs(sat00, 3) + rcs(s_mhu00, 3) + rcs(v_fhu00, 
     3) + rcs(sa_thu00, 4) + htn + dm + ckd + rcs(time, 5) + prev_state, 
     data = data, x = TRUE, y = TRUE, maxit = 30)
 
Frequencies of Responses
 Home/IPR Hospital ICU/Vent     Dead 
    34135     3526     1483        9 
 
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 39153 LR χ2 31342.15 R2 0.913 ρ 0.576
Distinct Y 4 d.f. 66 g 4.224
Cluster on data$patient_mrn Pr(>χ2) <0.0001 gr 68.282
Clusters 422 Score χ2 45993.60 |Pr(Y ≥ median)-½| 0.489
Y0.5 1 Pr(>χ2) <0.0001
max |∂log L/∂β| 2×10-5
β S.E. Wald Z Pr(>|Z|)
y≥Hospital   -2.7582   6.7917 -0.41 0.6847
y≥ICU/Vent  -10.0741   6.7857 -1.48 0.1376
y≥Dead  -16.9129   6.7853 -2.49 0.0127
ctd=2   0.0885   0.3752 0.24 0.8135
sex=2   1.3584   1.1304 1.20 0.2295
age   -0.1654   0.1488 -1.11 0.2664
age'   1.0730   0.4378 2.45 0.0142
age''  -22.9066   7.4872 -3.06 0.0022
age'''   52.9194  16.5439 3.20 0.0014
sma00   -0.0247   0.0450 -0.55 0.5825
sma00'   0.0034   0.1497 0.02 0.9818
sma00''   -0.0211   0.4576 -0.05 0.9633
height   -0.0070   0.0086 -0.81 0.4204
height'   0.0045   0.0091 0.49 0.6233
ascites_yn=2   -0.0717   0.1214 -0.59 0.5550
he=2   0.1558   0.0896 1.74 0.0822
race=2   -0.1439   0.1277 -1.13 0.2598
smoker=2   -0.2583   0.1125 -2.30 0.0216
cit   -0.0528   0.0428 -1.23 0.2175
cit'   0.0446   0.0534 0.84 0.4037
etiology=2   0.2410   0.1122 2.15 0.0317
etiology=3   0.2149   0.1144 1.88 0.0604
etiology=4   0.1508   0.1375 1.10 0.2728
etiology=5   -0.1083   0.1297 -0.84 0.4035
meld   0.0869   0.0272 3.19 0.0014
meld'   -0.2768   0.1518 -1.82 0.0683
meld''   1.0900   0.5937 1.84 0.0664
meld'''   -1.5743   0.8682 -1.81 0.0698
acr=2   0.1596   0.1135 1.41 0.1599
surgeryduration   -0.0008   0.0031 -0.27 0.7862
surgeryduration'   -0.0317   0.0315 -1.01 0.3132
surgeryduration''   0.1333   0.0999 1.33 0.1823
surgeryduration'''   -0.1741   0.1129 -1.54 0.1231
vat00   0.0006   0.0034 0.18 0.8571
vat00'   -0.0007   0.0168 -0.04 0.9644
vat00''   0.0024   0.0359 0.07 0.9466
sat00   -0.0006   0.0010 -0.61 0.5449
sat00'   0.0012   0.0011 1.11 0.2651
s_mhu00   -0.0103   0.0088 -1.17 0.2429
s_mhu00'   0.0069   0.0101 0.68 0.4950
v_fhu00   -0.0097   0.0121 -0.81 0.4201
v_fhu00'   0.0024   0.0140 0.17 0.8621
sa_thu00   0.0198   0.0153 1.29 0.1955
sa_thu00'   -0.0325   0.0586 -0.56 0.5788
sa_thu00''   0.0720   0.1341 0.54 0.5915
htn=2   0.0767   0.0749 1.02 0.3057
dm=2   0.1927   0.0893 2.16 0.0311
ckd=2   0.2477   0.0856 2.89 0.0038
time   0.0007   0.0094 0.07 0.9407
time'   0.0839   0.0839 1.00 0.3179
time''   -0.6894   0.3443 -2.00 0.0453
time'''   1.6983   0.6257 2.71 0.0066
prev_state=Hospital   10.2493   0.4679 21.91 <0.0001
prev_state=ICU/Vent   16.8385   0.6320 26.64 <0.0001
sex=2 × sma00   -0.0126   0.0091 -1.39 0.1644
sex=2 × sma00'   0.0769   0.0362 2.12 0.0337
sex=2 × sma00''   -0.2512   0.1161 -2.16 0.0304
age × sma00   0.0011   0.0011 0.98 0.3295
age' × sma00   -0.0082   0.0034 -2.41 0.0160
age'' × sma00   0.1810   0.0595 3.04 0.0024
age''' × sma00   -0.4241   0.1329 -3.19 0.0014
age × sma00'   -0.0024   0.0037 -0.64 0.5203
age' × sma00'   0.0254   0.0112 2.27 0.0229
age'' × sma00'   -0.5934   0.2066 -2.87 0.0041
age''' × sma00'   1.4898   0.4968 3.00 0.0027
age × sma00''   0.0070   0.0113 0.62 0.5330
age' × sma00''   -0.0707   0.0317 -2.23 0.0258
age'' × sma00''   1.6853   0.6019 2.80 0.0051
age''' × sma00''   -4.3450   1.5053 -2.89 0.0039
anova(g) %>% kbl(align = 'c') %>% kable_minimal
Chi-Square d.f. P
ctd 0.0556328 1 0.8135366
sex (Factor+Higher Order Factors) 6.2147866 4 0.1836718
All Interactions 6.0870577 3 0.1074506
age (Factor+Higher Order Factors) 21.9121324 16 0.1460514
All Interactions 19.6799976 12 0.0733867
Nonlinear (Factor+Higher Order Factors) 17.3754001 12 0.1360133
sma00 (Factor+Higher Order Factors) 38.5155286 18 0.0033084
All Interactions 27.6508203 15 0.0238598
Nonlinear (Factor+Higher Order Factors) 17.7467359 12 0.1235997
height 0.7462041 2 0.6885949
Nonlinear 0.2412018 1 0.6233394
ascites_yn 0.3484406 1 0.5549972
he 3.0213216 1 0.0821765
race 1.2699451 1 0.2597770
smoker 5.2751429 1 0.0216320
cit 2.1911074 2 0.3343544
Nonlinear 0.6973596 1 0.4036723
etiology 10.9696688 4 0.0269070
meld 76.3896162 4 0.0000000
Nonlinear 4.8877410 3 0.1802043
acr 1.9749246 1 0.1599261
surgeryduration 12.1701093 4 0.0161301
Nonlinear 10.0172735 3 0.0184199
vat00 0.6702529 3 0.8801774
Nonlinear 0.0429306 2 0.9787634
sat00 2.8046040 2 0.2460299
Nonlinear 1.2420024 1 0.2650855
s_mhu00 1.6187036 2 0.4451465
Nonlinear 0.4656927 1 0.4949754
v_fhu00 1.0336717 2 0.5964047
Nonlinear 0.0301790 1 0.8620847
sa_thu00 5.6777162 3 0.1283873
Nonlinear 0.3153140 2 0.8541427
htn 1.0491994 1 0.3056915
dm 4.6496299 1 0.0310602
ckd 8.3737018 1 0.0038069
time 14.8060627 4 0.0051208
Nonlinear 12.4742381 3 0.0059232
prev_state 779.7482188 2 0.0000000
sex * sma00 (Factor+Higher Order Factors) 6.0870577 3 0.1074506
Nonlinear 4.6917607 2 0.0957629
Nonlinear Interaction : f(A,B) vs. AB 4.6917607 2 0.0957629
age * sma00 (Factor+Higher Order Factors) 19.6799976 12 0.0733867
Nonlinear 15.7527076 11 0.1505627
Nonlinear Interaction : f(A,B) vs. AB 15.7527076 11 0.1505627
f(A,B) vs. Af(B) + Bg(A) 10.1378681 6 0.1189649
Nonlinear Interaction in age vs. Af(B) 13.4354105 9 0.1438736
Nonlinear Interaction in sma00 vs. Bg(A) 10.5941753 8 0.2257705
TOTAL NONLINEAR 75.9768483 36 0.0001117
TOTAL INTERACTION 27.6508203 15 0.0238598
TOTAL NONLINEAR + INTERACTION 77.8454151 38 0.0001472
TOTAL 1235.5124448 66 0.0000000
plot(anova(g), sort = c("ascending"))

prev_state is by far the strongest predictor of state. This makes sense as overtime the status of a patient becomes more stable and does not change much. Therefore, it becomes increasingly likely that the state the previous day is the same as the current state today. The next strongest predictors are meld, sma00 and time.

It is fairly easy to understand each parameter and how important each predictor is in the model, however the overall model is pretty complex with 72 degrees of freedom and many predictors in the form of splines.

Fast Backward Selection

Fast backward selection was performed to see if any additional predictors should be removed from the model.

print(fastbw(g, rule=c('p')), estimates=FALSE)
## 
##  Deleted     Chi-Sq d.f. P      Residual d.f. P      AIC   
##  vat00       0.67   3    0.8802  0.67     3   0.8802  -5.33
##  ctd         0.06   1    0.8055  0.73     4   0.9475  -7.27
##  height      0.53   2    0.7688  1.26     6   0.9740 -10.74
##  ascites_yn  0.18   1    0.6713  1.44     7   0.9844 -12.56
##  htn         0.82   1    0.3665  2.25     8   0.9723 -13.75
##  sma00       3.60   3    0.3075  5.86    11   0.8828 -16.14
##  s_mhu00     2.67   2    0.2629  8.53    13   0.8076 -17.47
##  race        1.00   1    0.3184  9.52    14   0.7961 -18.48
##  cit         2.02   2    0.3646 11.54    16   0.7749 -20.46
##  sex         0.93   1    0.3337 12.48    17   0.7705 -21.52
##  sex * sma00 4.31   3    0.2301 16.78    20   0.6669 -23.22
##  he          1.85   1    0.1738 18.63    21   0.6086 -23.37
##  sat00       4.56   2    0.1021 23.20    23   0.4492 -22.80
##  acr         3.93   1    0.0474 27.13    24   0.2985 -20.87
##  time        9.95   4    0.0414 37.07    28   0.1172 -18.93
## 
## Factors in Final Model
## 
##  [1] age             smoker          etiology        meld           
##  [5] surgeryduration v_fhu00         sa_thu00        dm             
##  [9] ckd             prev_state      age * sma00

Many variables were removed from the model in this process. sex was removed, but based on previous studies where sex was of importance, it is important to keep it in the model. Due to the importance of body composition variables to this analysis, all body composition variables were kept as well. However the number of knots for some of these variables can be reduced as they are not as important and will make the model less complex.

The list of variables that are removed based on the fast backward variable selection and domain knowledge are ascites_yn, ctd, htn, cit, race, acr and height. A new model was fit with these predictors removed.

fin <- orm(state ~ (sex + rcs(age,3) + rcs(sma00,3))^2 - rcs(age,3):sex +
           he + etiology + rcs(meld,5) + rcs(surgeryduration,3) + rcs(vat00,3) + 
            rcs(sat00,3) + rcs(s_mhu00,3) + rcs(v_fhu00,3) + rcs(sa_thu00,3) +
           dm + ckd + rcs(time,5) + prev_state, 
           data=data, x=TRUE, y=TRUE, maxit = 30)
final <- robcov(fin, data$patient_mrn)
print(final)
Logistic (Proportional Odds) Ordinal Regression Model
 orm(formula = state ~ (sex + rcs(age, 3) + rcs(sma00, 3))^2 - 
     rcs(age, 3):sex + he + etiology + rcs(meld, 5) + rcs(surgeryduration, 
     3) + rcs(vat00, 3) + rcs(sat00, 3) + rcs(s_mhu00, 3) + rcs(v_fhu00, 
     3) + rcs(sa_thu00, 3) + dm + ckd + rcs(time, 5) + prev_state, 
     data = data, x = TRUE, y = TRUE, maxit = 30)
 
Frequencies of Responses
 Home/IPR Hospital ICU/Vent     Dead 
    34135     3526     1483        9 
 
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 39153 LR χ2 31314.66 R2 0.913 ρ 0.576
Distinct Y 4 d.f. 40 g 4.173
Cluster on data$patient_mrn Pr(>χ2) <0.0001 gr 64.885
Clusters 422 Score χ2 45978.58 |Pr(Y ≥ median)-½| 0.489
Y0.5 1 Pr(>χ2) <0.0001
max |∂log L/∂β| 2×10-5
β S.E. Wald Z Pr(>|Z|)
y≥Hospital   -8.0451  3.3677 -2.39 0.0169
y≥ICU/Vent  -15.3360  3.3775 -4.54 <0.0001
y≥Dead  -22.1453  3.3857 -6.54 <0.0001
sex=2   -0.3043  0.7264 -0.42 0.6753
age   0.0085  0.0616 0.14 0.8907
age'   -0.0782  0.0709 -1.10 0.2703
sma00   -0.0005  0.0212 -0.02 0.9829
sma00'   -0.0205  0.0240 -0.85 0.3935
he=2   0.1546  0.0911 1.70 0.0897
etiology=2   0.2102  0.1091 1.93 0.0540
etiology=3   0.1894  0.1230 1.54 0.1236
etiology=4   0.1987  0.1447 1.37 0.1699
etiology=5   -0.0597  0.1139 -0.52 0.6002
meld   0.0872  0.0259 3.37 0.0008
meld'   -0.2959  0.1570 -1.88 0.0594
meld''   1.1717  0.6187 1.89 0.0582
meld'''   -1.7115  0.8964 -1.91 0.0562
surgeryduration   -0.0034  0.0014 -2.41 0.0160
surgeryduration'   0.0047  0.0014 3.28 0.0010
vat00   -0.0002  0.0020 -0.09 0.9296
vat00'   0.0010  0.0026 0.37 0.7088
sat00   -0.0004  0.0010 -0.36 0.7192
sat00'   0.0007  0.0011 0.66 0.5083
s_mhu00   -0.0076  0.0083 -0.91 0.3621
s_mhu00'   0.0042  0.0100 0.42 0.6730
v_fhu00   -0.0057  0.0137 -0.42 0.6771
v_fhu00'   -0.0051  0.0146 -0.35 0.7280
sa_thu00   0.0106  0.0092 1.16 0.2478
sa_thu00'   0.0011  0.0116 0.09 0.9260
dm=2   0.1950  0.0862 2.26 0.0238
ckd=2   0.2232  0.0814 2.74 0.0061
time   0.0093  0.0097 0.96 0.3386
time'   0.0533  0.0821 0.65 0.5157
time''   -0.6073  0.3362 -1.81 0.0708
time'''   1.6001  0.6199 2.58 0.0098
prev_state=Hospital   10.3355  0.4697 22.01 <0.0001
prev_state=ICU/Vent   16.9626  0.6413 26.45 <0.0001
sex=2 × sma00   0.0019  0.0052 0.36 0.7176
sex=2 × sma00'   0.0016  0.0086 0.18 0.8575
age × sma00   -0.0002  0.0004 -0.38 0.7012
age' × sma00   0.0007  0.0005 1.37 0.1695
age × sma00'   0.0005  0.0005 1.00 0.3168
age' × sma00'   -0.0009  0.0006 -1.34 0.1789

This model has a very high adjusted \(R^2\) of 0.913, meaning 91.3% of the variance in the state is explained by the model. It also has a \(\rho\) value of 0.576 which is a measure of rank discrimination or how well the model can differentiate different levels of state. the overall LR \(X^2\) is also very high with a value of 31,314.66 which shows there is definitely a significant association of at least one predictor in the model with the outcome state.

Model Validation

Now that I have chosen my final model, I want to validate that model to check for overfitting using bootstrap validation. Because I performed fast backward selection, I must penalize this variable selection process so that my standard errors and confidence intervals are not too small.

#could only get the code to work on subset of data with 65 days post liver transplant instead of 
# the full 90 days used in the rest of the analysis
sub_data = subset(data, time < 66)
sub <- orm(state ~ (sex + rcs(age,3) + rcs(sma00,3))^2 - rcs(age,3):sex +
           he + etiology + rcs(meld,5) + rcs(surgeryduration,3) + rcs(vat00,3) + 
            rcs(sat00,3) + rcs(s_mhu00,3) + rcs(v_fhu00,3) + rcs(sa_thu00,3) +
           dm + ckd + rcs(time,5) + prev_state, 
           data=sub_data, x=TRUE, y=TRUE, maxit = 30)
subset <- robcov(sub, sub_data$patient_mrn)

set.seed(1993)
v <- validate(subset, B=100, estimates=FALSE, bw=TRUE, rule='p', tol=1e-17, maxit=40)
## 
##      Backwards Step-down - Original Model
## 
##  Deleted     Chi-Sq d.f. P      Residual d.f. P      AIC   
##  vat00       0.42   2    0.8123  0.42     2   0.8123  -3.58
##  sex * sma00 0.95   2    0.6216  1.37     4   0.8499  -6.63
##  sex         0.03   1    0.8733  1.39     5   0.9252  -8.61
##  s_mhu00     1.13   2    0.5697  2.52     7   0.9258 -11.48
##  sat00       2.01   2    0.3668  4.52     9   0.8737 -13.48
##  sma00       2.65   2    0.2664  7.17    11   0.7852 -14.83
##  age         3.79   2    0.1504 10.96    13   0.6143 -15.04
##  etiology    7.92   4    0.0946 18.88    17   0.3356 -15.12
##  he          2.71   1    0.0999 21.58    18   0.2510 -14.42
##  v_fhu00     5.64   2    0.0595 27.23    20   0.1290 -12.77
##  dm          3.15   1    0.0760 30.37    21   0.0847 -11.63
## 
## Factors in Final Model
## 
## [1] meld            surgeryduration sa_thu00        ckd            
## [5] time            prev_state      age * sma00
v
##       index.orig training   test optimism index.corrected   n
## rho       0.6687   0.6688 0.6685   0.0002          0.6684 100
## R2        0.9138   0.9147 0.9133   0.0013          0.9124 100
## Slope     1.0000   1.0000 0.9779   0.0221          0.9779 100
## g         4.9219   5.0664 4.9443   0.1221          4.7998 100
## pdm       0.4844   0.4845 0.4843   0.0001          0.4842 100
## 
## Factors Retained in Backwards Elimination
## 
##  sex age sma00 he etiology meld surgeryduration vat00 sat00 s_mhu00 v_fhu00
##                *           *    *                                          
##  *                         *                                               
##          *     *           *    *                                          
##      *                     *                                               
##                            *    *                     *     *              
##                *           *    *                                          
##      *         *           *    *                                          
##                *  *        *    *                                          
##                            *                    *                   *      
##  *                         *    *               *                          
##                            *    *                                   *      
##      *                     *                                *              
##          *                 *    *                           *       *      
##                   *        *    *               *                          
##                            *                                               
##                *  *        *    *                     *     *       *      
##  *       *        *        *    *                                          
##      *                     *    *                                          
##      *         *           *    *               *                          
##                            *    *                           *              
##          *     *  *        *    *                                          
##                   *        *                                        *      
##      *                     *    *                                          
##                            *                                               
##                   *        *    *                     *                    
##                            *    *               *                   *      
##                            *    *               *                          
##      *   *        *        *    *                                   *      
##  *                         *                                               
##                   *        *                                        *      
##                *  *        *    *                                          
##      *                     *    *                                   *      
##                            *                                        *      
##                   *        *    *               *                   *      
##                   *        *                    *                          
##                            *    *                                   *      
##                            *    *                                   *      
##                            *    *               *                          
##                            *    *                                          
##                            *    *               *                          
##      *                     *    *                           *       *      
##      *   *                 *                                               
##      *            *        *                          *                    
##                *           *                                               
##                *  *        *                                        *      
##                *  *        *    *                                          
##                            *    *                                          
##          *                 *    *                                   *      
##                            *    *                                          
##                *  *        *    *                     *                    
##                   *        *    *               *           *              
##                *  *        *    *                                   *      
##                            *    *               *           *              
##      *            *        *    *               *     *                    
##      *                     *    *                                          
##      *                     *                                               
##      *   *        *        *    *                     *                    
##          *     *           *    *                                          
##                            *    *                                          
##                            *                    *                          
##                   *        *    *               *                          
##                            *                                        *      
##                   *        *    *                                          
##      *   *                 *    *                           *              
##      *            *        *    *               *           *              
##                            *    *                                          
##          *                 *    *                     *                    
##          *                 *    *                                   *      
##                *           *    *                                          
##      *   *        *        *                          *             *      
##                            *    *               *                          
##          *                 *    *                                   *      
##                *           *    *                           *              
##                   *        *                                        *      
##                *           *    *                                          
##                *  *        *    *                     *                    
##  *                         *    *               *                          
##                   *        *                          *     *              
##          *     *           *                    *                          
##                *           *    *                                          
##          *                 *                                               
##                *           *    *                                   *      
##                            *                    *                          
##                   *        *    *                                   *      
##                            *                                *              
##                   *        *    *                                          
##                            *                                               
##                *  *        *    *                                   *      
##                            *    *                           *              
##                   *        *    *               *                          
##                *           *    *                                          
##          *     *           *                                        *      
##                            *    *                                          
##      *                     *                                *              
##          *     *           *                                               
##                            *    *                           *              
##                            *                                *              
##                *           *                                               
##                            *    *               *     *     *              
##                *           *    *               *           *              
##  sa_thu00 dm ckd time prev_state sex * sma00 age * sma00
##              *   *    *                      *          
##              *   *    *          *           *          
##  *               *    *                                 
##  *           *   *    *                      *          
##           *  *   *    *          *                      
##  *               *    *                                 
##  *        *      *    *                      *          
##           *  *        *                      *          
##  *           *   *    *                      *          
##  *        *      *    *                      *          
##  *           *   *    *                      *          
##              *   *    *                      *          
##  *               *    *                      *          
##              *   *    *                      *          
##  *           *   *    *                      *          
##              *   *    *                      *          
##           *  *   *    *          *           *          
##              *   *    *                      *          
##           *  *   *    *          *           *          
##              *   *    *                      *          
##           *      *    *                      *          
##           *  *   *    *                                 
##  *               *    *                      *          
##  *        *  *   *    *                      *          
##  *        *  *   *    *                      *          
##  *           *   *    *                      *          
##  *        *      *    *                                 
##  *        *  *   *    *                      *          
##              *   *    *                                 
##  *        *  *   *    *                                 
##           *  *   *    *                      *          
##  *           *   *    *                      *          
##  *           *   *    *                      *          
##  *           *   *    *                      *          
##           *  *   *    *                                 
##  *        *  *   *    *                      *          
##  *               *    *                      *          
##              *   *    *                      *          
##           *  *   *    *                                 
##  *           *   *    *                      *          
##           *      *    *                      *          
##           *  *   *    *                      *          
##  *           *   *    *          *           *          
##  *           *        *                                 
##  *        *  *        *                      *          
##              *   *    *                      *          
##              *   *    *                      *          
##  *               *    *                                 
##  *        *      *    *                      *          
##              *   *    *                      *          
##  *        *  *        *                      *          
##  *        *  *   *    *                      *          
##  *               *    *                      *          
##  *           *   *    *                      *          
##  *           *   *    *                      *          
##           *  *   *    *                      *          
##  *        *  *   *    *                      *          
##                  *    *          *           *          
##           *  *   *    *                      *          
##              *   *    *                      *          
##  *        *      *    *                      *          
##  *               *    *                      *          
##              *   *    *                      *          
##  *        *  *   *    *                      *          
##                  *    *                      *          
##  *        *  *   *    *                      *          
##  *               *    *                      *          
##  *        *  *   *    *          *           *          
##              *        *                      *          
##  *           *   *    *                      *          
##  *        *  *   *    *                      *          
##  *        *  *   *    *                                 
##              *   *    *                      *          
##  *        *  *   *    *                                 
##  *           *   *    *                                 
##           *  *   *    *                      *          
##  *               *    *                                 
##              *   *    *                      *          
##  *           *   *    *                      *          
##                  *    *                      *          
##  *           *   *    *                                 
##  *               *    *                                 
##  *        *  *   *    *                      *          
##  *        *  *   *    *                      *          
##  *        *  *   *    *                      *          
##           *      *    *                      *          
##              *   *    *                                 
##  *           *   *    *                      *          
##           *      *    *                                 
##  *        *           *                      *          
##                  *    *                      *          
##  *           *   *    *                                 
##                  *    *                      *          
##           *      *    *                      *          
##              *   *    *                      *          
##              *   *    *                      *          
##              *   *    *                                 
##  *        *  *   *    *                                 
##                  *    *                                 
##              *   *    *                      *          
## 
## Frequencies of Numbers of Factors Retained
## 
##  4  5  6  7  8  9 10 11 12 
##  1  4 11 29 20 18  7  8  2
#couldn't figure out how to get a nice html version to print
#html(v)

Based on the low optimism for each index and the corrected \(R^2\), \(\rho\) and Slope, this model validates well and is not prone to very much overfitting. Just like backwards selection, validation shows that many predictors could be removed from the model. However, due to low risk of overfitting it is okay to keep these variables in my final model to understand the association and importance of all predictors.

Variable Importance

an2 <- anova(final)
an2 %>% kbl(align = 'c') %>% kable_minimal
Chi-Square d.f. P
sex (Factor+Higher Order Factors) 0.7494969 3 0.8615045
All Interactions 0.7490620 2 0.6876117
age (Factor+Higher Order Factors) 10.4353440 6 0.1074757
All Interactions 9.5183081 4 0.0493724
Nonlinear (Factor+Higher Order Factors) 4.8486817 3 0.1832187
sma00 (Factor+Higher Order Factors) 22.1511640 8 0.0046434
All Interactions 10.0150199 6 0.1240209
Nonlinear (Factor+Higher Order Factors) 2.0438385 4 0.7276959
he 2.8803800 1 0.0896649
etiology 8.4868835 4 0.0752858
meld 72.0974262 4 0.0000000
Nonlinear 7.4690505 3 0.0583590
surgeryduration 14.4345159 2 0.0007338
Nonlinear 10.7830563 1 0.0010243
vat00 0.4059105 2 0.8163148
Nonlinear 0.1394480 1 0.7088304
sat00 1.0408100 2 0.5942798
Nonlinear 0.4374906 1 0.5083362
s_mhu00 1.1059662 2 0.5752313
Nonlinear 0.1781522 1 0.6729661
v_fhu00 1.2185864 2 0.5437351
Nonlinear 0.1209744 1 0.7279801
sa_thu00 4.7485704 2 0.0930810
Nonlinear 0.0086290 1 0.9259891
dm 5.1118621 1 0.0237628
ckd 7.5139638 1 0.0061222
time 13.9236378 4 0.0075428
Nonlinear 13.6332339 3 0.0034494
prev_state 758.6366151 2 0.0000000
sex * sma00 (Factor+Higher Order Factors) 0.7490620 2 0.6876117
Nonlinear 0.0322579 1 0.8574629
Nonlinear Interaction : f(A,B) vs. AB 0.0322579 1 0.8574629
age * sma00 (Factor+Higher Order Factors) 9.5183081 4 0.0493724
Nonlinear 2.1017528 3 0.5515582
Nonlinear Interaction : f(A,B) vs. AB 2.1017528 3 0.5515582
f(A,B) vs. Af(B) + Bg(A) 1.8064724 1 0.1789320
Nonlinear Interaction in age vs. Af(B) 1.9796534 2 0.3716411
Nonlinear Interaction in sma00 vs. Bg(A) 1.8979262 2 0.3871422
TOTAL NONLINEAR 42.8295340 18 0.0008458
TOTAL INTERACTION 10.0150199 6 0.1240209
TOTAL NONLINEAR + INTERACTION 46.8468179 20 0.0006160
TOTAL 1119.1294757 40 0.0000000

The previous state dominates the model with a \(\chi^2\) of 759. There is evidence of nonlinearity in the model as well as nonlinearity and interaction, but not for interaction alone. Interaction for age and skeletal muscle mass is significant based on \(\alpha = 0.05\), but no other interaction terms were sigificant. Predictors that have a nonlinear relationship with state include meld, surgeryduration and time.

There are many different body composition measures in this analysis. Since I want to understand how overall body composition effects state outcomes for each day during the 90 day period post liver transplant, I did a combined chunk test of all body composition variables as well as looking at each model predictor individually.

b2 <- anova(final, sma00, vat00, sat00, s_mhu00, v_fhu00, sa_thu00)
b2 %>% kbl(align = 'c') %>% kable_minimal
Chi-Square d.f. P
sma00 (Factor+Higher Order Factors) 22.1511640 8 0.0046434
All Interactions 10.0150199 6 0.1240209
Nonlinear (Factor+Higher Order Factors) 2.0438385 4 0.7276959
vat00 0.4059105 2 0.8163148
Nonlinear 0.1394480 1 0.7088304
sat00 1.0408100 2 0.5942798
Nonlinear 0.4374906 1 0.5083362
s_mhu00 1.1059662 2 0.5752313
Nonlinear 0.1781522 1 0.6729661
v_fhu00 1.2185864 2 0.5437351
Nonlinear 0.1209744 1 0.7279801
sa_thu00 4.7485704 2 0.0930810
Nonlinear 0.0086290 1 0.9259891
TOTAL NONLINEAR 3.1342971 9 0.9587315
TOTAL 36.3458895 18 0.0063697

With a \(\chi^2\) of 36.3 and a p-value of 0.0063 there does seem to be an association with overall body composition and liver transplant outcome states. However, in looking at each body composition measure separately, skeletal muscle area (sma00) seems to be the only one to have an association with liver transplant outcome states.

Below we can see a plot of predictor importance taking into account nonlinearity and interactions as well as adding the measures for the overall body composition chunk test.

s2 <- rbind(an2, `body composition`=b2['TOTAL', ])
class(s2) <- 'anova.rms'
plot(s2, main = "Testing Statistical Importance", sort = c("ascending"))

Based on \(\alpha = 0.05\), the significant predictors are prev_state, MELD score (meld), skeletal muscle area (sma00), surgeryduration, time, chronic kidney disease (ckd), age and sma00 interaction, and diabetes (dm)

As expected, the most important parameter in my model is prev_state. The combined chunk test for body composition has a \(\chi^2\) = 36.3 and a p-value of 0.0064.

Partial Effect Plots

The below plots show the partial effects of each of the predictors in the final model on a mean scale.

ggplot(Predict(final), vnames='names', sepdiscrete='vertical')

The partial effects plots show a non-flat relationship with only time, prev_state and very slightly with meld. This makes sense as these are 3 of the most important predictors in the model. The effect size seems very small except for prev_state and time after 50 days meaning although there were many significant predictors that seem to have an association with outcome states, they have negligible effects on changing odds of outcome states based on changes in the predictors.

Estimating Baseline Covariate Effects

Below are plots showing the baseline covariate effects (log odds) for days 1, 15, 30, 45, 60, and 85 from the model for some of the most significant predictors.

t <-  c(1, 15, 30, 45, 60, 75)
ggplot(Predict(final, time = t, prev_state, fun=plogis))

Interestingly, overtime the odds of the previous state being ICU/Vent or Home/IPR don’t change, however, odds of being in the hospital seem to increase slightly at first and then decrease after 45 days post liver transplant.

ggplot(Predict(final, time = t, meld, fun=plogis))

ggplot(Predict(final, time = t, sma00, fun=plogis))

ggplot(Predict(final, time = t, surgeryduration, fun=plogis))

All three of the continuous predictors above seem to have wdie confidence intervals and negligible differences over time. As MELD scores increase, the odds increases at a very low amount and skeletal muscle area has the opposite effect. As surgery duration increases, log odds decreases slightly and then begins to increase but this also has large confidence intervals and such low effect of any change.

ggplot(Predict(final, time = t, ckd, fun=plogis))

ggplot(Predict(final, time = t, etiology, fun=plogis))

ggplot(Predict(final, time = t, dm, fun=plogis))

Confidence intervals seem to heavily overlap for all three categorical variables above without any interesting trends.

ggplot(Predict(final, time = t, age, fun=plogis))

Age also has extrememly low effect on changes in log odds. It seems for patients above 60 that log odds may increase slightly, but we can only say this with extremely low confidence and low effect size.

Interpreting Predictions

Here is a plot of the interquartile-range odds ratios for continuous predictors and simple odds ratios for categorical predictors. The numbers on the left axis next to each variable name indicate the upper and lower quartile ranges for continuous variables and the current versus reference group for categorical variables. The ranges on the left are in the original scale while the intervals that are plotted are on the log odds ratio scale.

plot(summary(final), log=TRUE)

This shows similar results to the partial effects plots above, as the only odds ratios not hovering around 1 are prev_state and time meaning no other predictors really influence any changes in log odds.

Converting Odds Ratios to Probabilities

The odds ratios output from the model can be converted to probabilities for ease of interpretation. You can look at any of the following probabilities.

  • P(Home/IPR) vs. P(Hospital, ICU/Ventilator, Dead)
    • P(Home/IPR, Hospital) vs. P(ICU/Ventilator, Dead)
    • P(Home/IPR, Hospital, ICU/Ventilator) vs. P(Dead)

Here is a plot of the probability of being at home or in an inpatient long term care facility averaging over all the patients for each day. The probability is extremely low at first as all patients start in the ICU post liver transplant and then it increases reaching practically 1 by around 50 days post transplant.

p <- predict(final, type='fitted.ind')
options(scipen = 50)
x <- as.data.frame(p)
y <- cbind(x, data)

prob_home <- y %>%
  group_by(time) %>%
  summarise_at(vars(`state=Home/IPR`), funs(mean(., na.rm=TRUE)))

ggplot(prob_home, aes(x = time, y = `state=Home/IPR`)) + geom_line() +
  labs(title = "Probability of Being at Home/IPR", x = "Day Post Liver Transplant", 
       y = "P(state = Home/IPR)")

Below is a plot of the mean, median and 90th percentile where skeletal muscle area is varied but all other predictors are held constant to either medians for continuous variables or modes for categorical variables. This plot is deceiving as it looks like as skeletal muscle mass increases the output log odds decreases, however in looking at the y-axis scale it shows it really only decreases by less than a thousandth.

M <- Mean(final, codes=TRUE)
qu <- Quantile(final, codes=TRUE)
med <- function(x) qu(.5 , x)
p90 <- function(x) qu(.9 , x)

pmean <- Predict(final, sma00 , fun=M, conf.int = FALSE )
pmed <- Predict(final, sma00 , fun=med , conf.int = FALSE )
p90 <- Predict(final, sma00 , fun=p90 , conf.int = FALSE )
z <- rbind('orm mean '=pmean , 'orm median '=pmed , 'orm P90 '=p90 )
ggplot(z, groups ='.set.',
adj.subtitle =FALSE , legend.label = FALSE )

Assessing Ordinality of State and Proportional Odds for each Predictor

The plots below examine the ordinality of state with respect to each predictor through assessing how levels of state vary related to the mean of each predictor. A monotonic relationship means that ordinality holds. The solid lines connect stratified means while the dotted lines connect the estimated expected values of \(X|Y = j\) give that the porportional odds assumption holds.

datad <- datadist(data)
options(datadist="datad")
plot.xmean.ordinaly(state ~ sex + age + sma00 + he + etiology + meld +
                      surgeryduration + vat00 + sat00 + s_mhu00 + v_fhu00 +
                      sa_thu00 + dm + ckd + time, data = data, cr=TRUE,
                    subn = FALSE, cex.points = 0.65)

Ordinality only seems to hold for surgeryduration and the proportional odds assumption does not seem to hold for any of the predictors. It holds fairly well for time over the first 3 states but then not for the state “Dead”. This is most likely due to only a small sample of patients dying in the first 90 days post liver transplant. The means for death are all extremely off their estimated expected values but due to lack of data for that category these means are hard to trust.

Below is an additional plot to try and understand if the proportional odds assumption holds for each of the predictors. The plus sign, triangle and circle correspond to different levels of state. When the proportional odds assumption holds, the differences in logits between different X values should be the same accross all levels of that predictor.

state2 = as.integer(data$state) - 1

sf =  function(y) {
  c('Y>=1' = qlogis(mean(y >= 1)),
    'Y>=2' = qlogis(mean(y >= 2)),
    'Y>=3' = qlogis(mean(y >= 3)))}
s1 = summary(state2 ~ sex + age + sma00 + he + etiology + meld + 
                      surgeryduration + vat00 + sat00 + s_mhu00 + v_fhu00 + 
                      sa_thu00 + dm + ckd + time + prev_state, fun = sf, data)
plot(s1, which = 1:3, pch = 1:3, xlab = 'logit', vnames = 'names', main = '', 
     width.factor = 1.5, xlim=c(-10, 0))

From this plot we can see that proportional odds does not seem to hold for many of the variables. time, and meld seem especially off at some levels, but sex, etiology and dm are the only predictors that seem to reliably satisfy proportional odds.

Does Predictor Importance Change with Different Time Periods?

Because patient’s states tend to stabilize with time, I am interested in running a simulation to understand whether the amount days post liver transplant in the data set impacts the importance of predictors in the model. This could also impact effect size. I hypothesize that the effect size of time will decrease as states are less stable closer to the liver transplant.

I will run a simulation of 100 bootstrap samples, stratified by patient, for each of 8 different end time points and analyze the odds ratios and \(\chi^2\) values for each number of days used in the model.

Although the model is not as flexible, for purposes of reducing complexity of this analysis, I used only linear predictors and no interaction for this simulation. This way each predictor will only have one coefficient in the model and it will be a bit easier to understand how that one coefficient changes with different numbers of days in the model.

lin <- orm(state ~ sex + age + sma00 + he + etiology + meld + surgeryduration + vat00 + 
              sat00 + s_mhu00 + v_fhu00 + sa_thu00 + dm + ckd + time + prev_state, 
             data=data, x=TRUE, y=TRUE, maxit = 40, tol=1e-17)
linear <- robcov(lin, data$patient_mrn)
an_l <- anova(linear)

N <- 100 #number of bootstrap samples
t = seq(20, 90, 10) #different periods of time post liver transplant

#df to capture for coefficients
col_names_coef_l <- append(names(linear$coefficients), c('days', 'N'))
df_coef_l <- data.frame(matrix(ncol = length(col_names_coef_l), nrow = 0))
colnames(df_coef_l) <- col_names_coef_l

#df to capture for testing
b_l <- as.data.frame(an_l)
b_l$var <- row.names(b_l)
col_names_stat_l <- append(colnames(b_l), c('days', 'N'))
df_stat_l <- data.frame(matrix(ncol = length(col_names_stat_l), nrow = 0))
colnames(df_stat_l) <- col_names_stat_l

patients <- data$patient_mrn %>% unique()

for (i in 1:N){
 
    #resampling by patient not observation
    ps <- sample(patients, replace = TRUE)
    s <- data.frame(matrix(ncol = length(colnames(data)), nrow = 0))
    colnames(s) <- colnames(data)

    for (p in ps){
      d <- data %>% filter(patient_mrn == p)
      s <- rbind(s, d)
    }
    
    for (j in t){
      #subset to data up to time t
      dat <- subset(s, time < j+1)
      
      #fit model
      l <- orm(state ~ sex + age + sma00 + he + etiology + meld + surgeryduration + vat00 + 
              sat00 + s_mhu00 + v_fhu00 + sa_thu00 + dm + ckd + time + prev_state, 
             data=data, x=TRUE, y=TRUE, maxit = 40, tol=1e-17)
      l2 <- robcov(l, data$patient_mrn)
      
      #model coefficients
      coef <- l2$coefficients
      c <- as.data.frame(transpose(as.list(coef)))
      c$days <- j
      c$N <- i
      df_coef_l <- rbind(df_coef_l, c)
      
      #stat testing
      a <- as.data.frame(anova(l2))
      a$var <- row.names(a)
      a$days <- j
      a$N <- i
      df_stat_l <- rbind(df_stat_l, a)
    }
  
  }

saveRDS(df_coef_l, 'SimCoef_linear.rds', compress='xz')
saveRDS(df_stat_l, 'SimStat_linear.rds', compress='xz')

Below are plots for all predictors showing the mean coefficient in the model as well as plots for all the predictors showing chi-squared values. These are for models fit across different amounts of days post-liver transplant. These results show that the importance and effect of predictors does not change depending on the number of days post liver transplant included in the analysis as all the plots show flat lines.

#read in simulation results
df_coef_l <- readRDS('SimCoef_linear.rds')
df_stat_l <- readRDS('SimStat_linear.rds')

#find mean odds ratios for each parameter
t = seq(20, 90, 10)

df_mean_coef_l <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df_mean_coef_l) <- c("mean", "var", "days")

for (i in t){
  
  df_l <- df_coef_l %>% filter(days == i)
  mean_df_l <- as.data.frame(colMeans(df_l))
  colnames(mean_df_l) <- c("mean")
  mean_df_l$var <- row.names(mean_df_l)
  mean_df_l$days <- i
  df_mean_coef_l <- rbind(df_mean_coef_l, mean_df_l)
  
}

vars <- mean_df_l$var %>% unique()

for (v in vars){

  g <- df_mean_coef_l %>% filter(var == v) %>% ggplot(aes(x = days, y = mean)) +
  geom_point() + labs(y = "Mean Odds Ratio", x = "Number of Days Included in Sample", 
                        title = paste0("Examining ", v))
  print(g)
}

vars <- df_stat_l$var %>% unique()

for (v in vars){

  g <- df_stat_l %>% filter(var == v) %>% ggplot(aes(x = days, y = `Chi-Square`)) + geom_point() + 
  labs(x = "Number of Days Included in Sample", title = paste0("Does Importance of ",v," Change?"))
  
  print(g)
}